Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
                                            Some full text articles may not yet be available without a charge during the embargo (administrative interval).
                                        
                                        
                                        
                                            
                                                
                                             What is a DOI Number?
                                        
                                    
                                
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
- 
            Diffusion policies have achieved superior performance in imitation learning and offline reinforcement learning (RL) due to their rich expressiveness. However, the conventional diffusion training procedure requires samples from target distribution, which is impossible in online RL since we cannot sample from the optimal policy. Backpropagating policy gradient through the diffusion process incurs huge computational costs and instability, thus being expensive and not scalable. To enable efficient training of diffusion policies in online RL, we generalize the conventional denoising score matching by reweighting the loss function. The resulting Reweighted Score Matching (RSM) preserves the optimal solution and low computational cost of denoising score matching, while eliminating the need to sample from the target distribution and allowing learning to optimize value functions. We introduce two tractable reweighted loss functions to solve two commonly used policy optimization problems, policy mirror descent and max-entropy policy, resulting in two practical algorithms named Diffusion Policy Mirror Descent (DPMD) and Soft Diffusion Actor-Critic (SDAC). We conducted comprehensive comparisons on MuJoCo benchmarks. The empirical results show that the proposed algorithms outperform recent diffusion-policy online RLs on most tasks, and the DPMD improves more than 120% over soft actor-critic on Humanoid and Ant.more » « lessFree, publicly-accessible full text available July 13, 2026
- 
            Free, publicly-accessible full text available July 22, 2026
- 
            Li, Yingzhen; Mandt, Stephan; Agrawal, Shipra; Khan, Emtiyaz (Ed.)Off-policy evaluation (OPE) is one of the most fundamental problems in reinforcement learning (RL) to estimate the expected long-term payoff of a given target policy with \emph{only} experiences from another behavior policy that is potentially unknown. The distribution correction estimation (DICE) family of estimators have advanced the state of the art in OPE by breaking the \emph{curse of horizon}. However, the major bottleneck of applying DICE estimators lies in the difficulty of solving the saddle-point optimization involved, especially with neural network implementations. In this paper, we tackle this challenge by establishing a \emph{linear representation} of value function and stationary distribution correction ratio, \emph{i.e.}, primal and dual variables in the DICE framework, using the spectral decomposition of the transition operator. Such primal-dual representation not only bypasses the non-convex non-concave optimization in vanilla DICE, therefore enabling an computational efficient algorithm, but also paves the way for more efficient utilization of historical data. We highlight that our algorithm, \textbf{SpectralDICE}, is the first to leverage the linear representation of primal-dual variables that is both computation and sample efficient, the performance of which is supported by a rigorous theoretical sample complexity guarantee and a thorough empirical evaluation on various benchmarks.more » « lessFree, publicly-accessible full text available May 3, 2026
- 
            Free, publicly-accessible full text available April 22, 2026
- 
            Free, publicly-accessible full text available April 16, 2026
- 
            Free, publicly-accessible full text available June 10, 2026
- 
            Abstract Using the FIRE-2 cosmological zoom-in simulations, we investigate the temporal evolution of gas-phase metallicity radial gradients of Milky Way–mass progenitors in the redshift range of 0.4 <z< 3. We pay special attention to the occurrence of positive (i.e., inverted) metallicity gradients—where metallicity increases with galactocentric radius. This trend, contrary to the more commonly observed negative radial gradients, has been frequently seen in recent spatially resolved grism observations. The rate of occurrence of positive gradients in FIRE-2 is about ∼7% for 0.4 <z< 3 and ∼13% at higher redshifts (1.5 <z< 3), broadly consistent with observations. Moreover, we investigate the correlations among galaxy metallicity gradient, stellar mass, star formation rate (SFR), and degree of rotational support. Metallicity gradients show a strong correlation with both sSFR and the rotational-to-dispersion velocity ratio (vc/σ), implying that starbursts and kinematic morphology of galaxies play significant roles in shaping these gradients. The FIRE-2 simulations indicate that galaxies with high sSFR ( ) and weak rotational support (vc/σ≲ 1) are more likely—by ∼15%—to develop positive metallicity gradients. This trend is attributed to galaxy-scale gas flows driven by stellar feedback, which effectively redistribute metals within the interstellar medium. Our results support the important role of stellar feedback in governing the chemo-structural evolution and disk formation of Milky Way–mass galaxies at the cosmic noon epoch.more » « lessFree, publicly-accessible full text available June 17, 2026
- 
            Free, publicly-accessible full text available December 9, 2025
- 
            A nonlocal phase-field crystal (NPFC) model is presented as a nonlocal counterpart of the local phase-field crystal (LPFC) model and a special case of the structural PFC (XPFC) derived from classical field theory for crystal growth and phase transition. The NPFC incorporates a finite range of spatial nonlocal interactions that can account for both repulsive and attractive effects. The specific form is data-driven and determined by a fitting to the materials structure factor, which can be much more accurate than the LPFC and previously proposed fractional variant. In particular, it is able to match the experimental data of the structure factor up to the second peak, an achievement not possible with other PFC variants studied in the literature. Both LPFC and fractional PFC (FPFC) are also shown to be distinct scaling limits of the NPFC, which reflects the generality. The advantage of NPFC in retaining material properties suggests that it may be more suitable for characterizing liquid–solid transition systems. Moreover, we study numerical discretizations using Fourier spectral methods, which are shown to be convergent and asymptotically compatible, making them robust numerical discretizations across different parameter ranges. Numerical experiments are given in the two-dimensional case to demonstrate the effectiveness of the NPFC in simulating crystal structures and grain boundaries.more » « less
- 
            Aligning large language models (LLMs) with human objectives is crucial for real-world applications. However, fine-tuning LLMs for alignment often suffers from unstable training and requires substantial computing resources. Test-time alignment techniques, such as prompting and guided decoding, do not modify the underlying model, and their performance remains dependent on the original model's capabilities. To address these challenges, we propose aligning LLMs through representation editing. The core of our method is to view a pre-trained autoregressive LLM as a discrete-time stochastic dynamical system. To achieve alignment for specific objectives, we introduce external control signals into the state space of this language dynamical system. We train a value function directly on the hidden states according to the Bellman equation, enabling gradient-based optimization to obtain the optimal control signals at test time. Our experiments demonstrate that our method outperforms existing test-time alignment techniques while requiring significantly fewer resources compared to fine-tuning methods. Our code is available at https://github.com/Lingkai-Kong/RE-Control.more » « lessFree, publicly-accessible full text available December 9, 2025
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
